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Abstract — In deterministic systems, reinforcement learning- 
based online approximate optimal control methods typically 
require a restrictive persistence of excitation (PE) condition 
for convergence. This paper presents a concurrent learning- 
based solution to the online approximate optimal regulation 
problem that eliminates the need for PE. The development Is 
based on the observation that given a model of the system, 
the Bellman error, which quantifies the deviation of the system 
Hamlltonlan from the optimal Hamiltonian, can be evaluated at 
any point in the state space. Further, a concurrent learning-based 
parameter identifier is developed to compensate for parametric 
uncertainty in the plant dynamics. Uniformly ultimately bounded 
(UUB) convergence of the system states to the origin, and UUB 
convergence of the developed policy to the optimal policy are 
established using a Lyapunov-based analysis. 

I. Introduction 

Reinforcement learning (RL) enables a cognitive agent to 
learn desirable behavior from interactions with its environ- 
ment. In control theory, the desirable behavior is typically 
quantified using a cost function, and the control problem 
is formulated as the desire to find the optimal policy that 
minimizes the cumulative cost. Recently, various RL-based 
techniques have been developed to approximately solve op- 
timal control problems for continuous-time and discrete-time 
deterministic systems ifTl-lfT?!. The approximate solution is 
facilitated via value function approximation, where the value 
function is approximated using a linear-in-the-parameters (LP) 
approximation, and the optimal policy is computed based on 
the estimated value function. 

Methods that seek an online solution to the optimal control 
problem, (cf., HI, ||4|) are structurally similar to adaptive 
control schemes. In adaptive control, the estimates for the 
uncertain parameters in the plant model are updated using 
the current tracking error as the performance metric, where as, 
in online RL-based techniques, estimates for the uncertain 
parameters in the value function are updated using the Bellman 
error (BE) as the performance metric. Convergence of online 
RL-based techniques to the optimal solution is analogous to 
parameter convergence in adaptive control. 

Parameter convergence has been a focus of research in 
adaptive control for several decades. It is common knowledge 
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that the least squares and gradient descent-based update laws 
generally require persistence of excitation (PE) in the system 
state for convergence of the parameter estimates. Modification 
schemes such as projection algorithms, cr— modification, and 
e— modification are used to guarantee boundedness of param- 
eter estimates and overall system stability. However, these 
modification schemes do not guarantee parameter convergence 
unless the PE condition, which is often impossible to verify 
online, is satisfied lfT4l - lfT71 . 

As recently shown in results such as |18| and |19|, con- 
current learning methods can be used to guarantee parameter 
convergence in adaptive control without relying on the PE 
condition. Concurrent learning relies on recorded state infor- 
mation along with current state measurements to update the 
parameter estimates. Learning from recorded data is effective 
since it is based on the model error, which is closely related to 
the parameter estimation error The key concept that enables 
the computation of the model error from past recorded data is 
that the model error can be computed if the state derivative is 
known, and the state derivative can be accurately computed 
at a past recorded data point using numerical smoothing 
techniques ifTSl. lfT9l. 

In RL-based approximate online optimal control, parame- 
ter estimates are updated based on the BE along the state 
trajectories. Such weight update strategies create two chal- 
lenges for analyzing convergence. The system states need 
to be PE for parameter convergence, and the policy, which 
is based on the estimated weights, needs to regulate the 
system states to a neighborhood around the desired goal so 
the information around the desired trajectory can be used to 
learn the value function. For example, in an infinite horizon 
regulation problem, if the policy does not regulate the system 
states to a neighborhood around the origin, the optimal value 
function (and hence, the optimal policy) near the origin can 
not be learned, defeating one of the control objectives. These 
challenges are typically addressed by adding an exploration 
signal to the control input (cf. (|4|, ifTJl . Il20l ) to ensure 
sufficient exploration in the desired region of the state space. 
However, no analytical methods exist to compute the appro- 
priate exploration signal for nonlinear systems. 

In this paper, the aforementioned challenges are addressed 
for an infinite horizon optimal regulation problem on a non- 
linear, control affine plant with LP uncertainties in the drift 
dynamics by observing that if the system dynamics are known, 
the BE can be computed at any desired point in the state 
space. Unknown parameters in the value function can therefore 
be adjusted based on least square minimization of the BE 



evaluated at any number of desired points in the state space. 
For example, in an infinite horizon regulation problem, the 
BE can be computed at sampled points uniformly distributed 
in a neighborhood around the origin of the state space. The 
results of this paper indicate that convergence of the unknown 
parameters in the value function is guaranteed provided the 
selected points satisfy a rank condition that is weaker than 
the PE condition. Since the BE can be evaluated at any 
desired point in the state space, sufficient exploration can be 
achieved by appropriately selecting the points in a desired 
neighborhood. 

If the system dynamics are partially unknown, an approxi- 
mation to the BE can be evaluated at any desired point in the 
state space based on an estimate of the system dynamics. In 
this paper, a concurrent learning-based parameter estimator is 
developed to exponentially identify the unknown parameters 
in the system model, and the parameter estimates are used to 
compute an approximation to the BE. The unknown parame- 
ters in the value function are updated based on the approximate 
BE, and uniformly ultimately bounded (UUB) convergence 
of the system states to the origin and UUB convergence of 
the parameter estimates (and hence, UUB convergence of the 
developed policy to the optimal policy) is established using a 
Lyapunov-based analysis. 

II. Problem Formulation 
Consider a control affine nonlinear dynamic system 

X (i) = / (x (t)) + g {x (t)) u{t),te (0, oo] , (1) 

where x £ M" denotes the system state, u e K™ denotes the 
control input, / : M" — > R" denotes the drift dynamics, and 
g : R" -^ j^nxm (jgjjotes the control effectiveness matrix. The 
objective is to solve the infinite horizon optimal regulation 
problem onUne, i.e., to find the optimal policy u* : M" — > M™ 
defined as 

oo 

u* = arg min j r [x (r) , u [x (r))) dr, (2) 

ueu to 

while regulating the system states to the origin. In (|2]l, U 
denotes the set of admissible state feedback policies, and 
r : M" X M™ -^ [0,oo) denotes the instantaneous cost defined 
as 



r (x, u) — X Qx + u Ru, 



where Q E 



and R e 



are constant positive 



definite symmetric matrices. The class of nonlinear systems 
considered in this paper is characterized by the following 
assumption. 

Assumption 1. The drift dynamics / is an unknown, LP 
locally Lipschitz function such that / (0) = 0, and the control 
effectiveness matrix 5 is a known, bounded locally Lipschitz 
function. 

A closed-form solution to the optimal control problem is 
formulated in terms of the optimal value function V* : M" ^■ 



[0,00) defined as 

00 
V*(xo)^ min / r (a; (r) , u (a; (r))) dr, Va;o € R", 

«:R"->R™J 
ueU to 

(3) 
where x (t) , r e [to, 00) denote the trajectory of ([l} with 
the feedback control law ■u{x{t)) and the initial condition 
X (to) = xq. Assuming that V* is continuously differentiable, 
and V* (0) — 0, the optimal control law can be determined as 

where V^; denotes the partial derivative with respect to x. 

The optimal value function can be obtained by solving the 
corresponding Hamilton-Jacobi-Bellman (HJB) equation 



"^xV* (/ + gu*) + x^Qx + u*^Ru* = 0. 



(4) 



Analytical solution of the HJB equation is generally infeasible; 
hence, an approximate solution is sought. An approximate 
solution based on minimizing the BE is facilitated by replacing 
V* and u* in (HI by their respective subsequently defined 
estimates V and u to compute the BE 5 (V,x,u] e R as 



(5) 



"^xV (/ + gu) + x'^Qx + u^Ru. 



The control objective is achieved by simultaneously adjusting 
the estimates V and u to minimize the BE evaluated along 
the trajectory x (t). The BE depends on the drift dynamics 
/. Since the drift dynamics are unknown, an adaptive system 
identifier is developed in the following section. 

III. System Identification 

Let f (x) ~ Y (x) 9* be the linear parametrization of the 
function /, where Y : R" -^ R"^^ is the regression matrix 
and 9* e MP is the vector of constant unknown parameters. Let 
/ : R" X RP — >■ R" be an estimate of the unknown function 
/ defined as / (x, §) = Y (x) 9, where 9 (<) e R" is the 
vector of parameter estimates. To estimate the drift dynamics, 
an identifier is designed as 



X — f + gu + kxX, 



(6) 



where the state estimation error x is defined ?& x — x — x and 
kx e R"^" is a positive definite, constant diagonal observer 
gain matrix. From ([T) and (|6]l the identification error dynamics 
can be derived as 

i = Y9-k^S:, (7) 

where 9 is the parameter identification error defined as 9 — 



A. Concurrent learning-based parameter update 

In traditional adaptive control, convergence of the estimates 
9 {t) to their true values 9* is ensured by using a PE condition 
|fT5l -|fT7|. To ensure convergence without the PE condition, 
a concurrent learning-based approach can be employed [18], 
|19|. The following observability assumption relaxes the PE 



condition that is required for parameter convergence in adap- 
tive control. 

Assumption 2. IfTSJI . |fT9ll There exists a finite set of time 
instances {tj \ j = 1,- ■ ■ , M} such that 



The time derivative of the Lyapunov function is given by 



Vn = X X — 



~9^T-H 



(13) 



rank 




Using (|7| and ( fTOl l. the Lyapunov derivative in ( fT3] l can be 
expressed as 



(8) 



Vo^ 



•Ij i^x •ij 



where Y,- =Y {x{tj)). 

The condition in ([8]l is satisfied as long as the system states 
are exciting over a finite period of time, and hence, is weaker 
than the PE condition. Furthermore, unlike the PE condition, 
the rank condition in dHJ can be verified online since it is a 
function of past states. To design the concurrent learning-based 
parameter update law, time instances {tj | j = 1, • • • , M} are 
selected such that the condition in ^ holds, and the states 
Ixj ^ X {tj) I j = 1, • • • ,M> and the corresponding control 

values < Uj = u (tj) | j = 1, • • • , A/ > are recorded in a history 
stack. The update law is then designed as 




(14) 



Let y G M be the minimum eigenvalue of (J2i=i^-F^j 

Since (X],=i ^■F^j ] i^ symmetric and positive semi-definite 
(ISJ can be used to conclude that it is also positive definite, 
and hence y > 0. Using (fT2l) . the Lyapunov derivative in (fl4l l 
can be expressed as 



Vn<^v\\z\\ <--Vn. 



(15) 



TeY'i + Tgkf, 



M 



(ij 



5j' 



Y,e 



(9) 



where gj = g{xj), Tg e W^f is a constant positive 
definite adaptation gain matrix, and kg is a constant positive 
concurrent learning gain. The update law in (|9} depends on 
the unknown state derivative Xj — x (tj) . However, since the 
state derivative is from recorded data, numerical smoothing 
techniques based on past and future data can be used to 
obtain good estimates of the derivative. In the presence of 
derivative estimation errors, the parameter estimation errors 
can be shown to be UUB, where the size of the ultimate bound 
depends on the error in the derivative estimate [191. From ([1} 
and the definitions of 9 and /, the bracketed term in (|9]l, can 



In ( fTSl ), V ~ min{^kx,ykg) G M, where fca; e R denotes 
the minimum eigenvalue of the matrix kx- The inequalities 
in (flSl l can be used to conclude that 9 (t) -^ and 
||a;(i)|| -^ exponentially fast. Provided the state trajectory 
X (t) is boundeqj, sup^ ||y (i)|| € Coo- From ^, \\x {t)\\ < 

\\Y{t)\\ 9{t) 



kx\\x{t)\\, and hence, 1 1 £ (i) 1 1 — >■ 0. 
The concurrent learning-based observer results in expo- 
nential regulation of the parameter and the state derivative 
estimation errors. In the following, the parameter and state 
derivative estimates are used to approximately solve the HJB 
equation in (|4]i without the knowledge of the drift dynamics. 

IV. Approximate Optimal Control 

Based on the system identifier developed in Section |III1 the 
BE in ^ can be approximated as 



be expressed as Xj 



5j' 



update law in (|9]l can be expressed as 



Yj9 = Yj9 and the parameter S [x, u, V, 9) = V^V [Y9 + gu\ + x' Qx + u' Ru. (16) 



= TgY^i 



Tgk, 




(10) 



B. Convergence analysis 



LetV^r 







jn+p 



— > [0, oo) be a positive definite continuously 



differentiable candidate Lyapunov function defined as 



1 



z.Tz. 



1.7 



^0-2^^5+-^ 



(11) 



The following bounds on the Lyapunov function can be 
established: 

v\\z\\<V^<v\\z\\, (12) 

where v_ = ir7im(l,7) and v — irnaa; (1,7) are positive 



known constants, z = 



X^ , 



G M"+P, and 7,7 G 



denote the minimum and the maximum eigenvalues of the 
matrix Fr^. 



In the following, the approximate BE in (fT6] l is used to obtain 
an approximate solution to the HJB equation in (|4]l. 

A. Value function approximation 

Approximations to the optimal value function V* and the 
optimal policy u* are designed based on neural network 
(NN)-based representations. The NN-based representation is 
facilitated by a temporary assumption that the state trajectory 
X it) evolves on a compact subspace x C M". The compact- 
ness assumption is common in neural network-based adaptive 
control (cf. 1211 . Il22l ). and it is shown in the subsequent 
stability analysis that the states evolve on a compact set 
provided the initial condition x {to ) is bounded (see Remark 
[T|in the subsequent stability analysis). The following standard 
NN assumption describes a NN-based representation of the 
optimal value function. 

'Remark (T) in the subsequent analysis shows that x (t) G Coo- 



Assumption 3. On the compact set Xi the optimal value 
function V* can be represented using a NN as 



V* 



W*^a + e, 



(17) 



where W* E M^ is the ideal weight matrix, which is bounded 
above by a known positive constant W in the sense that 
|jp^*|| < W^, CT : X ^ K^ = [o-i • • • ctl]^ is a bounded 
continuously differentiable nonlinear activation function such 
that (T (0) = and a' (0) = 0, L e N is the number of neurons, 
and e : X ^ K is the function reconstruction error such that 
supj.g^ \e {x)\ < € and sup^.^^ |e' (a::)| < e', where e, e' e K 
are known positive constants. 

Based on ( fTTI i a NN-based representation of the optimal 
controller is derived as 



u* = --R-'g^{a'^W* 



iT\ 



(18) 



The NN-based approximations to the optimal value function 
in ( fTTb and the optimal policy in ( fTsT i are defined as 



V ^ WTa, 



u^-\R-'9^a'^Wa. 



(19) 



where Wc (t) G K-^ and Wa (t) G M^ are estimates of the 
ideal weights W*. The use of two sets of weights to estimate 
the same set of ideal weights is motivated by the stability 
analysis and fact that it enables a formulation of the BE that 
is linear in the value function weight estimates Wc, enabling 
a least squares-based adaptive update law. 

Based on iT% . the approximate BE in (fTST i can be expressed 
as 



S(^X,Wa,Wc,9) 



^UJ^Wr. 



where w fa;, 9, Wa\ = a' {x)[f [ x, 
is the regressor vector 



f u^Rii, (20) 

g{x)u(^x,Wajj 



B. Learning based on desired behavior 

In traditional RL-based algorithms, the value function esti- 
mate and the policy estimate are updated based on observed 
data. The use of observed data to learn the value function 
naturally leads to a sufficient exploration condition which 
demands sufficient richness in the observed data. In stochastic 
systems, this is achieved using a randomized stationary policy 
(cf. [131, fSOl, fSSl), whereas in deterministic systems, a 
probing noise is added to the derived control law (cf. [1]- 
0, Q, ED)- The technique developed in this result is 
based on the observation that if an estimate of the system 
dynamics is available, an approximation to the BE can be 
evaluated at any desired point in the state space. The following 
condition, similar to the condition in (O, enables the use 
of approximate BE evaluated at a pre-sampled set of points 
{xi G X I * = 1; ■ ■ ■ 5 -^} in the state space. 

Assumption 4. There exists a set of points 

{xj e X I * = 1; • • ■ 1^} such that Vi € [0, oo) , 



rank 




= L. 



(21) 



In ( |2TI ). Pi = 1 + vLoJVijJi e M are the normalization terms, 
where i^ G K is a constant positive normalization gain, F : 
(i) G M'^^^ is the least-squares gain matrix, and 

Wi f Xj, 0, Waj = Cr' {xi) if {xi,e\ + g {xi) U (xi, Wa 

To facilitate the stability analysis, let 



c = T7 inf 

N \ te[o,oo) 



V^ UJiUJt 



(22) 



where A„ji„ {•} denotes the minimum eigenvalue. Provided 
Assumption |4] holds, the infimum in ( l22l l is positive. The 
condition in ^ is weaker than the PE condition in previous 
results such as |l|-f3l, |T|, fSTl and unlike the PE condition, 
(O can be verified online. Since the rank condition in (1211 1 
depends on the estimates 9 (t) and Wa (t), it is in general, 
impossible to guarantee a priori. However, heuristically, the 
condition in (|2TI) can be met by collecting redundant data, 
i.e., by selecting more points than the number of neurons by 
choosing N ^ L. 

The approximate BE can be evaluated at the sampled points 
{xi I i = 1, • • • , N} as 

S, (x,,Wa, Wc, ^) = u;fWc + xfQx, + uj Ru„ (23) 

where Uiixi^Waj = -\R^'^g{xif a {x^)'^ Wa- A con- 
current learning-based least-squares update law for the value 
function weights is designed based on the subsequent stability 
analysis as 

N 

ric2 ^ NT^ w,: 



Wc^~vc^r^s-^TJ2'^S^, 



=iP^ 



py - TiciT- 



r 1 



{l|r||<r} 



, lir(to)|l<r, (24) 



where Ij.j denotes the indicator function, F > G M is the 
saturation constant, /3 > G M is the forgetting factor, and 
Vci,ric2 > G K. are constant adaptation gains. The update 
law in (l24l) ensures that the adaptation gain matrix is bounded 
such that 

r< ||F(i)|| <r, ViG [0,oo) (25) 

The policy weights are then updated to follow the value 
function weights as 

Wa = -Val [Wa " l^c) " Va2Wa 

J,c.GlWa.-^^,c2G^^^Wa.f]^^^ (26) 

i=l 



Ap 



4A^p. 



where rjai,ria2 G M are positive constant adaptation gains and 
Ga 4 a'gR-^g^a'^ G M^><^. 

The update law in (|24] | is fundamentally different from the 
concurrent learning adaptive update in results such as [18], 
lfT9l . in the sense that the points {xi G x I * = li ' ' ■ ? ^} ^6 
selected a priori based on prior information about the desired 
behavior of the system. For example, in the present case, since 



the objective is to regulate the system states to the origin and 
the system is deterministic, it is natural to select a bounded set 
of points uniformly distributed around the origin of the state 
space. This difference is a result of the fact that the developed 
RL-based technique uses the approximate BE as the metric to 
update the weight estimates. Given the system dynamics, or 
an estimate of the system dynamics, the approximate BE can 
be evaluated at any desired point in the state space, whereas 
in adaptive control, the prediction error is used as a metric 
which can only be evaluated at observed data points along the 
state trajectory. 



V. Stability analysis 

To facilitate the subsequent stability analysis, the approxi- 
mate BE is expressed in terms of the weight estimation errors 
Wc^W* ~ Wc and Wa ^ W* - Wa- Subtracting © from 
(|20] |. the unmeasurable form of the instantaneous BE can be 
expressed as 



facilitate the subsequent analysis 



^1^ 



VciLfe' 
4^^^ ' 



N 



^2^E 



( vc2\Kmw 



^3^ 



LYVciW\\a'\ 
4aAT 



^4^ 



J«. 



^5^ 



Tjciuj {2W'^a'G€'^ + G, 



Ap 



N 

E 



■nc2^i^i 



Np, 



hvTQ, + ie'G^CT'^ 



^7^ 



•nci\\Gc\ 



N 

E 



11c2 \\Gcri\\ 

8iv^^^ 



The main result of this paper can now be stated as follows. 

Theorem 1. Provided Assumptions (O - hold and gains 
Q, VC2, '>]a2, ond kg are selected large enough based on the 
following sufficient conditions 



S = V^V 



(f + guj + u^Ru ~ \I^V* if + gu*) - u*^Ru*, 



-oj^W, - W*^a'Y~e + ^W^G^Wa + ^G, - e'/ 



+ lw*^a'Ge'^, 



(27) 



Va2>-—+^7W^-^^ 



ke > 



q > ^1, ?7c2 > 



^2 + Ci^^z 
yCi ' 

Ca^rW + 7jai + 2 {^i + Ci^2 + ^sZ) 



2c 



(31) 



where G = gR'^g'^ e M"><" and G, = e'Ge'^ e M. 
Similarly, the approximate BE evaluated at the sampled states 
{xi \ i = 1,- ■ ■ , N} can be expressed as 



S, = ~uJW, + -^W^G^^Wa - W*^a[Y,9 + A„ (28) 



where A, ^ \W*'^a[Gie'^ + \Gei - f'Ji e K is a constant. 
On the compact set x the functions / and Y are Lipschitz 
continuous, and hence, there exist positive constants Lf, Ly G 
M such thall 



11/ (a;)|| < Lf \\x\\ , ||y {x)\\ < Ly \\x\\ ,V.t e x- (29) 
Using ( |25] ), the normalized regressor — can be bounded as 



UJ {x) 



p{x) 



< 



1 



2^^T 



(30) 



where the operator (•) : [0, oo) -^ [0, oo) is defined as (•) = 
sup2,gjj„ (•). The following positive constants are defined to 



where Z (to) G R is a positive constant that depends on 
the initial condition of the system, the observer in ^ along 
with the adaptive update law in ^ and the controller in 
f |79| ) along with the adaptive update laws in ( 1241 ) and f |26D 
ensure that the state x [t), the state estimation error x (t), 
the parameter estimation error 6 (t), the value function weight 
estimation error Wc (t) and the policy weight estimation error 
Wa (t) are UUB, resulting in UUB convergence of the policy 
ii ix (t) , Wa (t) 1 to the optimal policy u* {x (t)). 



Proof: Let Vl : M2n+2L+p ^ jq^ ^^ _^ jq^ ^^ ^g ^ ^^^_ 
tinuously differentiable positive definite candidate Lyapunov 
function defined as 



U^ 



Tt^-I, 



1, 



VL = V* + -W,'r-'Wc + -W^Wa + Vo, (32) 

where V* is the optimal value function, and Vq was introduced 
in ( fTTT i. Using the fact that V* is positive definite, ( fT2] i. 
and Lemma 4.3 from fSSl yield 



vi{\\Z\\)<VLiZ,t)<vl{\\Z\\). 



(33) 



for all t e [0,oo) and for all Z e R2«+2i+p. :„ ^^ vj_,vl : 
[0, oo] — > [0, oo) are class /C functions and 



^The Lipschitz property is exploited here for clarity of exposition. The 
bounds in (29) can be easily generalized to ||/(x')|| < Lj (||a;||) ||a;||, 



||y(x)|| < Ly (||x||) ||a;||, where Lf, Ly 
nondecreasing functions. 



are positive, 



Z^ 



x^, wj, wl, x^, F 



The time derivative of (l32T i along the trajectories of dl), (|7]), 



( [Tol l, ( |24] |, and (|26l) is given by 



Vl^V* ~ WjV-^Wc - W^T-^tv-^Wc - W^Wa + Vo, 



\ P N 



N \ 



i VFjr-i f /3r - 77,1 f r^:^r ) ) r-^w, - i^Ki 



p 



Wl [-Tlal {Wa - VKc) - Va2Wa) 



~T UclGlWaUJ-r 



N 



4p 



y^ ric2Gl,WaUjJ ^ 



i=l 



^Np, 



N 



Substituting for the approximate BEs from ( |27t and (|28}, using 
the bounds in (|29] l and (|30] l. and using Young's inequahty, the 
Lyapunov derivative can be upper-bounded as 



Vl < -ix \\x\\ - L. 
2 



ie 



+ t?5 






A 



Wa 
Wa 



fcx ||2;| 



^4 



where l^, ia, i-c (t), i-e (t) G ^ are defined as 

A q A '7q1 , q tI7 /^ ^^2 + 1 

/,x A Q A Q C2^?7W^ + Val , II /+MI 

i-c[t) =?7c2C-'i?i -C1W2 ^ ■d3\\x[t)\\ , 

Provided the conditions in dSTT i are satisfied, the inequahties 



ric2 > 



C2^7W + Val+2 i^i + Clt^2 + ^?3 \\X (t)||) 

2c 



y 



(34) 



hold for all t e [0, 00) (see Remark [U, and hence, the 
coefficients Lx, La, ic (t), and Lg (t) are positive for all t E 
[0,00). Completing the squares, the Lyapunov derivative can 
be expressed as 



k^m' 



VL<-i 


X \ 


l|2 ^c 

x\\ -^ 


w. 


2 


" 2 


Wa 






2 








- Le 





+ t, 























<-vi\\Z\\, V||Z||>t>0, 



(35) 



where '^=2f" + 2f"'^^4^^ ^^^ ^'i± G IR are the lower 



bounds on be [t) and tg (i), respectively. Theorem 4.18 in [.25 J 
can now be invoked to conclude that Z (t) is UUB. ■ 

Remark 1. If ||Z(0)|| > l then Vl {Z {0)) < 0. Thus, 
Vl {Z (t)) is decreasing at i = 0, and hence, Z (t) £ Coo 



at f — 0+. Thus all the conditions of Theorem [T] are satisfied 
at t = 0+. As a result, Vl {Z [t)) is decreasing at t = 0+. By 

induction, ||Z(0)|| > l =^ Vl (Z (t)) < Vl (Z (0)) ,\/t € 
M+. Thus, from (O, \\Z {t)\\ < vT^ {vj {\\Z {0)\\)) ,yt e 
[0, 00). If II Z (0)11 < t then (EU and (l35]l can be used to deter- 
mine that w^(||Z(i)||) < VL{Z{t)) < w(INI),Vt e [0,00). 
As a result, ||Z(f)|| < vT^ {vj {l)) yt G [0,cx)). Let the 
positive constant Z e M be defined as 

Z^vr'ivlimaxi\\ZiO)\\,L))). 

This relieves the compactness assumption in the sense that 
the compact set x C M" that contains the system trajectories 

x{t),\/t e [0,00) is given by X = {a;eM" I ||x|| <Z}. 
Furthermore, ||Z(t)|| < Z,\/t E [0,oo) implies that the gain 
conditions in JTH are sufficient for the inequalities in (|34] | to 
hold for all te [0,oo). 

VI. Conclusion 

A RL-based online approximate optimal controller is de- 
veloped that does not require PE for convergence. The PE 
condition is replaced by a weaker rank condition that can be 
verified online from recorded data. An approximation to the 
BE computed at pre-sampled desired values of the system state 
using an estimate of the system dynamics is used to improve 
the value function approximation, and UUB convergence of 
the system states to the origin, and UUB convergence of the 
policy to the optimal policy are established using a Lyapunov- 
based analysis. 
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